![]() Acrobat file (254K) |
![]() ClarisWorks 4 file (60K) |
![]() QuickView file (473K) |
Technote 1008 | OCTOBER 1995 |
With the second generation of Power Macintosh computers, Apple has
transitioned the Macintosh I/O expansion bus from NuBus(TM) to PCI.
Apple's underlying policy is to support the PCI standard, as expressed
in the PCI Local Bus Specification, Revision 2.0.
The adoption of the PCI standard brings many advantages to the Macintosh platform. Arguably, one of the most significant is increased I/O bandwidth. Developers frequently ask questions on PCI bus commands with an eye toward bus performance. This Technote examines the PCI bus commands, the operation of the IB chip (the PowerPC processor to PCI interface bridge chip), achievable PCI bandwidth on PCI Power Macintosh computers, and, finally, Mac OS services available to maximize PCI bandwidth.
This Technote is written for PCI hardware designers and driver writers
who are developing for the Power Macintosh platform.
Contents
Since the IB chip competes for system memory along with other system devices, continuous PCI bursting is not possible. Therefore, the achievable PCI bandwidth on Power Macintosh computers -- a significant improvement from NuBus -- will be less than the PCI theoretical maximum. Also, the bandwidth will be dependent on the PCI target's hardware design and the architecture of the driver software.
A PCI burst transfer is defined by one PCI bus transaction with a signal address phase followed by two or more data phases. One may ask, how can the bus master transfer a data object on each PCI clock cycle? To initiate a bus transaction, the PCI master only has to arbitrate for ownership of the bus one time. The master then issues the start address and transaction type during the address phase. It's the responsibility of the target device to latch the start address into an address counter and increment the addressing from data phase to data phase. (A single-beat read or write transaction is defined by a signal address phase followed by only one data phase.)
For data to be transferred between the PowerPC Processor and the PCI Target, or for the PCI Target to transfer data between system memory, one of the following commands is initiated, as shown in Table 1.
With the basics of the PCI bus described and details of the Power Macintosh PCI implementation outlined, this should be ample background to describe the functionality of the IB chip. In particular, under what circumstance will it perform what type of PCI command?
If the PCI target's address space is set to write thru cache mode, the IB chip will perform an eight-beat burst read on PCI with the Memory Read Line command. This translates to a cache line, eight 4-byte long words, i.e. 32-bytes.
If the PCI target's address space is set to write back cache mode, the IB chip will perform an eight-beat burst write on PCI with the Memory Write and Invalidate command.
The PCI Memory Write and Invalidate command will perform an 8-beat transaction if the address is aligned on a 32-byte boundary.
The PCI Memory Read Line or Memory Read Multiple commands will perform an eight-beat transaction if the address is aligned to an address less than or equal to 8-bytes less than the next 32-byte boundary. The PCI Memory Read Line and Memory Read Multiple commands are treated the same by the IB chip, in either case the IB chip will disconnect after an eight-beat transaction -- one 32-byte cache line.
The numbers in Tables 2 and Table 3 are based on the following assumptions:
With this reference release OS, Apple starts to separate between APIs (Application Programming Services) and SPI (System Programming Services). In this present Mac OS release and the future direction, such as Copland, APIs and toolbox services are no longer available to driver SW. The Mac OS version 7.5.2 provides a DSL (Driver Services Library) that implements all SPI services available for drivers; documented in Designing PCI Cards and Drivers for Power Macintosh Computers, Chapter 9.
To coordinate I/O operations that transfer buffers between system memory and PCI address space, the Macintosh OS provides two functions with the DSL (Driver Services Library): PrepareMemoryForIO
, and CheckpointIO
. The PrepareMemoryForIO
function allocates resident system memory to buffers, provides logical and physical address information, and in conjunction with CheckpointIO
manages coherency between system memory and the PowerPC caches. CheckpointIO
is called after the buffer transfer is complete and either relinquishes the memory back to the OS and adjusts the processor caches for coherency, or prepares for another IO transfer.
PrepareMemoryForIO
should not be confused with PCIPrepareMemoryForIO
is an example of a service in the DSL; PCI cards that have DMA hardware should use PrepareMemoryForIO
to locate physical addresses in system memory. Older I/O expansion cards would typically use a toolbox call GetPhysical
to locate physical
addresses in system memory. To be fully compatible with the present and future Mac OS releases, drivers should only use SPI services. Again, this is fully
documented in Designing PCI Cards and Drivers for Power Macintosh Computers.
Remembering that PCI address space defaults to cache inhibit mode, to enable the PowerPC to burst to areas of PCI memory space, that area must be set
to cacheable setting. This can be done with the SetProcessorCacheMode
(see chapter 9 in Designing PCI Cards and Drivers for Power
Macintosh
Computers). Set the desired PCI address space to kProcessorCacheModeCopyBack
for cache line writes and
kProcessorCacheModeWriteThrough
for cache line reads.
SetProcessorCacheMode
has an undocumented limitation. The PowerPC address space is divided into sixteen 256-Mbyte segments that are distinguished by the upper 4-bits of the effective address. The SetProcessorCacheMode
is only capable of changing the cache setting for one contiguous section of memory per 256-Mbyte segment. Therefore, if two PCI cards are configured where they both have PCI address assignments in the same segment only one card can change its address space cache setting.
As an example, if two cards (card x and card y) have addresses mapped into segment 8, one at 0x80800000 and another at 0x80801000, the first call to SetProcessorCacheMode
from the driver of card x to make a cacheable address space in segment 8 will work. A second call, say from the driver of card y, to modify the cache setting in segment 8 will not work nor will it report an error. This scenario will most likely result in a lower than expected performance for card y, because card y address space is actually cache inhibited which disables PCI transactions of 32-byte cache lines. If the two cards are mapped into different segments, such as 8 and A, then they both can modify the cache settings within their perspective segments. This limitation will be relaxed in the future.
Extensions to the BlockMove
routine have been incorporated in the DSL that optimizes performance on the PowerPC CPU family. In particular, BlockMoveData
has been optimized for data that is cacheable and BlockMoveDataUncached
for data that is cache inhibited. The difference between the cached and uncached versions of these instructions is that, for BlockMoveData
, the PPC dcbz
instruction is used to avoid the logically unnecessary read of the destination cache blocks. BlockMoveDataUncached
does not use the dcbz
instruction because dcbz
is extremely slow for address space marked cache inhibited or cache write thru.
Table 4 lists the different BlockMove functions provided in the DSL
The difference between BlockMove
and BlockMoveData
versions is whether or not the block being moved contains 68K instructions. If the data does contain 68K instructions BlockMove
must be called which also flushes the DR (Dynamic Recompilation) Emulator's cache. This is costly time-wise, so if the block does not contain 68K instructions, be sure to use BlockMoveData
or BlockMoveDataUncached
. Also with performance in mind, when appropriate the BlockMove
routines will align the source and destination address to utilize floating-point load and store instructions.
To summarize the BlockMove
routines, for transfers of large buffers between PCI cards the MoveBlockData
or BlockMoveDataUncached
functions should be used, depending if the destination address space is marked write back cacheable or not. Furthermore, PCI drivers most likely will not need to consider the non-Data variant of the BlockMove
routines because destination buffers either in PCI address space or system memory will probably not need to execute 68K code.
A common question from PCI developers is, how to initiate a PCI burst of a cache line? Provided the PCI address space is marked cacheable as explained
earlier, the BockMoveData
function will force the IB chip to burst 32-byte cache lines -- eight-beat data phases per PCI command transaction.
To read or write PCI I/O space, the Expansion Bus Manager provides routines to transfer data -- byte, word, or long word (8, 16, or 32 bits, respectively) -- using PCI I/O Read and I/O Write commands. The Expansion Bus Manager is part of the ROM firmware in PCI Power Macintosh CPUs. These routines also perform appropriate byte swapping. For a further description, refer to Designing PCI Cards and Drivers for Power Macintosh Computers, chapter 10. PCI cards that are limited to I/O space, and do not incorporate PCI memory space, are limited to PCI I/O Read and I/O Write commands to transfer data between the PPC and PCI target. If PCI I/O data needs to be processed quickly, note there is a significant performance hit using Expansion Manager Routines. These routines are intended for PCI targets that have I/O registers or low bandwidth I/O buffers. The IB chip does not burst PCI I/O Read nor burst PCI I/O Write commands.
As described in chapter 10 of Designing PCI Cards and Drivers for Power Macintosh Computers along with sample code, the PCI property "assigned-addresses" provides vector entries that represent physical addresses on PCI cards. Using the "APPL,address" property a driver can locate a logical address of a physical I/O resource. By accessing the logical I/O address the IB chip will generate the appropriate PCI I/O command. Therefore a driver can generate PCI I/O commands without using the Expansion Bus Manager Routines; the same way it accesses PCI memory space. This provides the fastest way to access I/O space, but note it does not perform byte swapping as the Expansion Bus Manager routines.
Also note, the Expansion Bus Manager provides OS services to generate PCI Configuration Read, Configuration Write, Interrupt Acknowledge, and Special Cycle commands.